In [ ]:
%%HTML
<style>
.container { width:100% }
</style>
We need to read our data from a csv file. The module csv
offers a number of functions for reading and writing a csv file.
In [ ]:
import csv
The data we want to read is contained in the csv file 'cars.csv'
. In this file, the first column has the miles per gallon, while the engine displacement is given in the third column. We convert miles per gallon into km per liter (1 mile = 1.60934 kilometres, 1 gallon = 3,78541 litres)) and cubic inches into liters (1 cubic inch = 0.0163871 litres).
In [ ]:
with open('cars.csv') as cars_file:
reader = csv.reader(cars_file, delimiter=',')
line_count = 0
kpl = []
displacement = []
for row in reader:
if line_count != 0: # skip header of file
# miles per gallon is in first column
kpl .append(float(row[0]) * 1.60934 / 3.78541)
# engine displacement is in third column
displacement.append(float(row[2]) * 0.0163871)
line_count += 1
print(f'{line_count} lines read')
Now kpl
is a list of floating point numbers specifying the fuel eficiency, while the list displacement
contains the corresponding engine displacements measured in cubic inches.
In [ ]:
kpl[:5]
The fuel consumption is the inverse of the variable kpl
. The variable lph
gives the number of liters needed to drive 100 kilometres.
In [ ]:
lph = [ 100 / x for x in kpl]
In [ ]:
lph[:5]
Yes, these old American cars had a terrible fuel efficiency. But a look at the engine displacements gives us a clue about what is going on.
In [ ]:
displacement[:5]
The number of data pairs of the form $\langle x, y \rangle$ that we have read is stored in the variable m
.
In [ ]:
m = len(displacement)
m
In order to be able to plot the fuel efficiency versus the engine displacement and we turn the lists displacement
and lph
into numpy
arrays.
In [ ]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
In [ ]:
X = np.array(displacement)
Y = np.array(lph)
In [ ]:
plt.figure(figsize=(12, 12))
sns.set(style='whitegrid')
plt.scatter(X, Y, c='b')
plt.xlabel('engine displacement in litres')
plt.ylabel('litre per 100 km')
plt.title('Fuel Consumption Versus Engine Displacement')
Next, we want to show how linear regression can be formulated as a minimization problem and how this minimization problem can be solved using TensorFlow.
In [ ]:
import tensorflow as tf
This example differs from our first example as this time the function that we want to minimize depends on a
set of training data. Therefore, we have to define
placeholders to insert our data into TensorFlow. We define a placeholder
for the independent variable displacement
and a placeholder
for the dependent variable lph
.
As we do not want to hardwire the number of examples, we set the shape
of these placeholders to None
.
In [ ]:
X_ph = tf.placeholder(tf.float32, shape=(None,))
Y_ph = tf.placeholder(tf.float32, shape=(None,))
We have a linear model to predict the fuel consumption from the displacement. This linear model is as follows: $$ Y = \vartheta \cdot X $$ Here $X$ is the engine displacement, while $Y$ is the fuel consumption. Note that this linear model does not include a bias. The reason is that this bias should be $0$ as a car without an engine won't use any fuel.
A first guess for $\vartheta$ would be the average fuel consumption divided by the average engine displacement:
In [ ]:
theta_initial = np.mean(Y) / np.mean(X)
theta_initial
$\vartheta$ is the variable that we want to find. Hence we declare it as a TensorFlow Variable
.
In [ ]:
ϑ = tf.Variable(theta_initial, dtype=tf.float32)
The loss function is defined as the sum of the squares of the errors. In order to normalize the loss, we divide it by the number of training examples $m$. $$ \texttt{loss} := \frac{1}{m} \cdot \sum\limits_{i=1}^m \bigl(\vartheta \cdot x_i - y_i\bigr)^2 $$ Here $x_i$ is the engine displacement of the $i$-th training example, while $y_i$ is the fuel consumption of this training example. Our goal is to determine the value of $\vartheta$ that mimimizes this loss function.
The function square takes an array and squares it elementwise. The function reduce_sum computes the sum of all elements of an array.
In [ ]:
loss = tf.reduce_sum(tf.square(ϑ * X_ph - Y_ph)) / m
loss
We will use gradient descent to minimize our loss function. After some experimentation, I have chosen a learning rate $\alpha$ of $0.03$:
In [ ]:
α = 0.03
train = tf.train.GradientDescentOptimizer(α)
optimizer = train.minimize(loss)
Finally, we can start a TensorFlow session and run our optimizer for 11 steps of gradient descent.
Observe how we have used the dictionary data_dict
to feed the
training data into our optimizer.
In [ ]:
init = tf.global_variables_initializer()
with tf.Session() as s:
s.run(init)
data_dict = {X_ph: X, Y_ph: Y}
for k in range(9):
s.run(optimizer, data_dict) # one step of gradient descent
theta, l = s.run([ϑ, loss], data_dict) # evaluate the variable ϑ and the loss function
print('%2d: ϑ = %f, loss = %f' % (k, theta, l))
We can conclude: For a car from the seventies or early eighties that has an engine displacement of $d$ litres, the fuel consumption is about $3.18 \cdot d$ litres per 100 kilometres.
If we compare this notebook to the notebook Simple-Linear-Regression.ipynb that we had developed at the beginning of this lecture we notice the following:
Simple-Linear-Regression.ipynb
we had to derive a formula to compute the minimum
of the loss function.Finally, we plot the results.
In [ ]:
xMax = max(X) + 0.2
plt.figure(figsize=(12, 10))
sns.set(style='darkgrid')
plt.scatter(X, Y, c='b')
plt.plot([0, xMax], [0, theta * xMax], c='r')
plt.xlabel('engine displacement in cubic inches')
plt.ylabel('fuel consumption in litres per 100 km')
plt.title('Fuel Consumption versus Engine Displacement')
In [ ]: